AITopics | videoqa benchmark

Collaborating Authors

videoqa benchmark

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding Xinyu Fang

Neural Information Processing SystemsFeb-17-2026, 02:45:14 GMT

The advent of large vision-language models (L VLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Law (0.67)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.72)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

Neural Information Processing SystemsDec-27-2025, 14:02:03 GMT

Our world is full of varied actions and moves in specialized fields that we, as humans, seek to identify and learn about. To evaluate the effectiveness of multi-modal models in helping us recognize such fine-grained actions, we introduce ActionAtlas, a video question answering (VideoQA) benchmark on fine-grained action recognition with short videos across various sports. ActionAtlas contains 554 videos spanning 284 actions across 42 sports with 1161 actions as total potential choices. Unlike most existing action recognition benchmarks that focus on simplistic actions, often identifiable from a single frame, ActionAtlas focuses on intricate movements and tests the models' ability to discern subtle differences. Additionally, each video in ActionAtlas also includes a question, which helps to more accurately pinpoint the action's performer in scenarios where multiple individuals are involved in different activities. We evaluate proprietary and open models on this benchmark and show that the state-of-the-art models only perform at most 48.73% accurately where random chance is 20%. Furthermore, our results show that a high frame sampling rate is essential for recognizing actions in ActionAtlas, a feature that current top proprietary models like Gemini lack in their default settings.

actionatla, domain-specialized action recognition, videoqa benchmark, (2 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.60)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding Xinyu Fang

Neural Information Processing SystemsOct-10-2025, 11:52:39 GMT

The advent of large vision-language models (L VLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding.

benchmark, dataset, mmbench-video, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.67)

Industry:

Law (0.67)
Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.72)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition

Neural Information Processing SystemsMay-27-2025, 21:35:03 GMT

actionatla, artificial intelligence, domain-specialized action recognition, (1 more...)

Neural Information Processing Systems

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Vision (0.89)

Add feedback

BoViLA: Bootstrapping Video-Language Alignment via LLM-Based Self-Questioning and Answering

Chen, Jin, Ma, Kaijing, Huang, Haojian, Shen, Jiayu, Fang, Han, Zang, Xianghao, Ban, Chao, He, Zhongjiang, Sun, Hao, Kang, Yanmei

arXiv.org Artificial IntelligenceSep-17-2024

The development of multi-modal models has been rapidly advancing, with some demonstrating remarkable capabilities. However, annotating video-text pairs remains expensive and insufficient. Take video question answering (VideoQA) tasks as an example, human annotated questions and answers often cover only part of the video, and similar semantics can also be expressed through different text forms, leading to underutilization of video. To address this, we propose BoViLA, a self-training framework that augments question samples during training through LLM-based self-questioning and answering, which help model exploit video information and the internal knowledge of LLMs more thoroughly to improve modality alignment. To filter bad self-generated questions, we introduce Evidential Deep Learning (EDL) to estimate uncertainty and assess the quality of self-generated questions by evaluating the modality alignment within the context. To the best of our knowledge, this work is the first to explore LLM-based self-training frameworks for modality alignment. We evaluate BoViLA on five strong VideoQA benchmarks, where it outperforms several state-of-the-art methods and demonstrate its effectiveness and generality. Additionally, we provide extensive analyses of the self-training framework and the EDL-based uncertainty filtering mechanism. The code will be made available at https://github.com/dunknsabsw/BoViLA.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2410.02768

Country:

Asia > China > Shaanxi Province > Xi'an (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.41)

Add feedback

ANetQA: A Large-scale Benchmark for Fine-grained Compositional Reasoning over Untrimmed Videos

Yu, Zhou, Zheng, Lixiang, Zhao, Zhou, Wu, Fei, Fan, Jianping, Ren, Kui, Yu, Jun

arXiv.org Artificial IntelligenceMay-3-2023

Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the fine-grained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene graphs. The fine-grained properties of ANetQA are reflected in the following: (i) untrimmed videos with fine-grained semantics; (ii) spatio-temporal scene graphs with fine-grained taxonomies; and (iii) diverse questions generated from fine-grained templates. ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos. Comprehensive experiments are performed for state-of-the-art methods. The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement.

benchmark, natural language, question answering, (19 more...)

arXiv.org Artificial Intelligence

2305.02519

Country: Asia > China > Zhejiang Province > Hangzhou (0.04)

Genre: Research Report > Promising Solution (0.34)

Industry: Leisure & Entertainment > Sports (0.93)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)

Add feedback